gradeAn ongoing topic in all levels of education is the influence of socioeconomic factors on academic performance. This starts from infancy, but could extend through all years of life.
In 2021, the publication “Ongelijkheid in het Nederlandse onderwijs door de jaren heen” (eng., Inequality in Dutch education over the years) highlighted the social inequality in educational achievement. In the process of going through the article, it becomes evident that there exists a substantial difference between students from less privileged and more affluent social backgrounds in their educational achievement. This difference is reflected in both primary and secondary school academic results (Diris, 2022).
As mentioned above, there is a clear distinction between the educational achievements between students from different backgrounds in the Netherlands. Several factors may come into play here.
One factor not considered here is the Dutch school system. The Dutch school system consists of 8 years of elementary school, in which the last year an advice is given for the level of secondary school of the student in question. There has been criticism of this system, including by Boztas (2023): Education or ‘selecting the new elite’?, is the title of his article. Boztas raises concerns about the current school system in this article. That determines what level of secondary education they will follow: VWO, Havo or VMBO comes too early in a child’s development.
The Dutch Education Council agrees, they recommend delaying the moment of selection until after a broad three-year bridge period. This gives students more time for development (Ministerie van Onderwijs, Cultuur en Wetenschap, 2023). Students who come from a lower social background and, for example, need more time to learn the language, but are not less intelligent than his peers.
In Portugal, a comparable system is in place. Elementary school spans nine years, with the final three years comprising lower secondary education. At the age of 15, students transition to secondary education, where a choice will be made between further studies or dual certification, academic and vocational (European Commission, 2023).
The question is whether this system can reduce the difference in educational achievement between students from lower and higher social backgrounds. Or whether in Portugal, akin to the Netherlands, there exists a substantial difference between students from lower and higher social backgrounds in educational achievement. The results of this research will be socially relevant.
Hence, this research will center on the following research question: To what extent do socioeconomic factors, including demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support , impact the academic performance of high school students in Portugal?
To load specific packages in R, we use the function
library() as follows:
library(ggplot2) # for visualizing our data in R
library(naniar) # for handling missing values
library(dplyr ) # for ...
library(reshape2) # for ...
library(psych)
library(GGally) # for the ggcorr() function
library(rpart) # for the CART algorithm
library(rpart.plot) # for plotting the decision trees
library(C50) # for the C5.0 algorithm
library(randomForest) # for the Random Forest algorithm
library(plyr) # for the 'mutate' function
library(liver) # for the find.na
library(forcats) # for the "fct_collapse" function
library(Hmisc) # for replacing missing values
set.seed(1)Education in Portugal is mandatory from the age of 6 up to the age of 18. This system is divided into three cycles, the first cycle consists of 4 years, the second cycle spans 2 years and the last cycle lasts for 3 years as mentioned. The set up for the school curriculum is to make sure that everyone gets a basic education. This form of education gives important knowledge and skills, which can be used for more learning in the future (European Commission, 2023). The focus of this research will be on the last cycle.
The research will be conducted around the academic achievements of students in secondary school, the last cycle, following the course math. Their grades will take a value of 1 to 20, where 10 is equivalent to a passing grade. The inspection will be taken if there will be a correlation/relation between the predicting variables and the outcome variable ‘Final grade’.
To get a better view on the relationship between socioeconomic status and academic performance (in Portugal), this research will explore 5 socioeconomic determinants of grades.
Variables addressed in this section: school,
sex and age.
A worldwide phenomenon, women out-perform men in education. This happens not only in primary school, secondary school or college, but also in higher education. The reasons for this can vary greatly by socioeconomic context (Ullah & Ullah, 2019). The question is whether this conclusion can also be drawn in the context of this research, in Portugal.
As mentioned earlier, secondary school in Portugal, as we know it in the Netherlands, only starts at age 15 and lasts three years. Thus, the dataset of this study will be based mainly on students aged 15 to 18. The students older than this age limit will in all likelihood have failed a class at least once. Perhaps it can be concluded that this group therefore has more difficulty with this subject. This could lead to lower grades than its younger peers.
Hypotheses to be tested:
There are also a number of personal factors that can contribute to academic performance. Factors that may be related to this are study time and the ambition to take higher education. As mentioned earlier, under the heading “Extra school support”, students who are eager to attend a university are likely to put a lot of effort into it. This should be reflected in the weekly study time. After which this effort should be transposed into better grades. For the named factors, the positive link to grades can also be inferred independently.
Which link is directly opposite is student absence and the number of past class failures. The common intuition is that the more often someone is absent, the more material that student misses and thus has less preparation for the exam. Which increases the likelihood of a lower grade. This intuition is confirmed by Leon (2018), that absences negatively affect the grades. As a result, these lower grades increase the likelihood of someone failing the class. In addition, an inference can be drawn from failing a class, that these individuals may have more difficulty in that particular subject. So the probability of lower grades for these students is also higher.
Hypotheses to be tested:
Address is fairly vague at the moment, this can take two values in this research. These values are rural or urban. Which of the two will generally be more intelligent and thus in all likelihood perform better in school? A study from America suggests that people with higher intelligence are more likely to live in an urban area than in a rural area (Hayes, 2014). Will this be the same case for the people in Portugal?
Another influential environmental variable is whether or not someone has access to the Internet. The Internet can be of great use in education. Solving inquiries is much easier using the Internet. Thus, access to the Internet can influence a student’s results. This is also reflected in the studies of Amponsah et al. (2022), where students who had access to the Internet made greater progress in academic performance than without it.
Hypotheses to be tested:
Parents are our first teachers in life, whether it is taking care of yourself, relationships or education. It all starts with our parents. They are the first to give us something to learn in our lives. These first life lessons will not be the same for all families. For example, how someone is raised and what they do or do not inherit from their family can be very different.
What plays an important factor in this is the education of the parents. The link between parental education and academic performance has been studied numerous times. Similarly, Idris et al. (2020) shows in his study that there is a significant difference between the results of students with highly educated parents and the average result. Whereas the pass rate of the students with highly educated parents is 87.67%, it differs significantly from the average which is a pass rate of 72.15%. In his study, he shows that the lower the students’ parents are educated, the lower the student’s success rate. In other words the academic performance of students with highly educated parents are generally considered better. This research aims to test this conclusion with the chosen dataset.
Education of parents is not the only thing that can affect academic performance, the occupation of parents can certainly play a role in this as well. Today, there are countless occupations that a parent can take on. Thus, it will also be nearly impossible to test the influence of each occupation on the academic performance of their children. This research will be focused on a select number of occupations, just as in the study by Owuor, Simatwa, and Ndolo (2022) where it was interpreted that parents with occupations requiring higher education influence academic performance more positively, than the other occupations. This research will examine this relationship between parents with higher education occupations and students’ academic performance. Thus, the following hypotheses will be tested:
Hypotheses to be tested:
To import the CSV data in R , we use the
function read.csv() as follows:
To see the overview of the dataset in R, we use the
function str() as follows:
'data.frame': 395 obs. of 33 variables:
$ school : chr "GP" "GP" "GP" "GP" ...
$ sex : chr "F" "F" "F" "F" ...
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ address : chr "U" "U" "U" "U" ...
$ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
$ Pstatus : chr "A" "T" "T" "T" ...
$ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
$ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
$ Mjob : chr "at_home" "at_home" "at_home" "health" ...
$ Fjob : chr "teacher" "other" "other" "services" ...
$ reason : chr "course" "course" "other" "home" ...
$ guardian : chr "mother" "father" "mother" "mother" ...
$ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
$ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ schoolsup : chr "yes" "no" "yes" "no" ...
$ famsup : chr "no" "yes" "no" "yes" ...
$ paid : chr "no" "no" "yes" "yes" ...
$ activities: chr "no" "no" "no" "yes" ...
$ nursery : chr "yes" "no" "yes" "yes" ...
$ higher : chr "yes" "yes" "yes" "yes" ...
$ internet : chr "no" "yes" "yes" "yes" ...
$ romantic : chr "no" "no" "no" "yes" ...
$ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
$ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
$ goout : int 4 3 2 2 2 2 4 4 2 1 ...
$ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
$ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
$ health : int 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
$ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
$ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
The data set currently comprises of 32 predictor variables along with
the target variable grade. To ensure our focus remains on
the most relevant factors, we may need to prune or remove some of the
insignificant variables in order to enhance the efficiency of our
analysis.
To emphasize our focus on the socioeconomic factors influencing
students’ grades, we will eliminate variables that are not relevant to
this investigation. The following variables will be excluded from our
data set: famsize, Pstatus,
reason, guardian, traveltime,
activities, nursery, romantic,
famrel, freetime, goout,
Dalc, Walc, G1,
G2.
To remove certain variables from the data frame
in R, we use the function select() as
follows:
data = select(data, -c('famsize', 'Pstatus', 'reason', 'guardian', 'traveltime', 'activities', 'nursery', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'G1', 'G2'))In order to see if it had an effect we call the
function str() :
'data.frame': 395 obs. of 18 variables:
$ school : chr "GP" "GP" "GP" "GP" ...
$ sex : chr "F" "F" "F" "F" ...
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ address : chr "U" "U" "U" "U" ...
$ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
$ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
$ Mjob : chr "at_home" "at_home" "at_home" "health" ...
$ Fjob : chr "teacher" "other" "other" "services" ...
$ studytime: int 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ schoolsup: chr "yes" "no" "yes" "no" ...
$ famsup : chr "no" "yes" "no" "yes" ...
$ paid : chr "no" "no" "yes" "yes" ...
$ higher : chr "yes" "yes" "yes" "yes" ...
$ internet : chr "no" "yes" "yes" "yes" ...
$ health : int 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
The insignificant variables have been removed from our data set. However, a large proportion of our variables have been misinterpreted during the importation of the data frame. Specifically, variables that intended to be categorical have been imported as character data types. Accurate data type assignment is crucial since certain operations are valid only with specific data types.
To convert variable data types from character to factor in the data
frame in R, we use the
function as.factor() as follows:
data$'school' <- as.factor(data$school)
data$'sex' <- as.factor(data$sex)
data$'address' <- as.factor(data$address)
data$'Medu' <- as.factor(data$Medu)
data$'Fedu' <- as.factor(data$Fedu)
data$'Mjob' <- as.factor(data$Mjob)
data$'Fjob' <- as.factor(data$Fjob)
data$'studytime' <- as.factor(data$studytime)
data$'schoolsup' <- as.factor(data$schoolsup)
data$'famsup' <- as.factor(data$famsup)
data$'paid' <- as.factor(data$paid)
data$'higher' <- as.factor(data$higher)
data$'internet' <- as.factor(data$internet)We will also rename some of our variables to make the analysis easier
to preform. To rename variables in the data frame in R,
we use the function names() as follows:
names(data)[5] <- 'medu'
names(data)[6] <- 'fedu'
names(data)[7] <- 'mjob'
names(data)[8] <- 'fjob'
names(data)[18] <- 'grade' Again, in order to see if it had an effect we call the
function str() :
'data.frame': 395 obs. of 18 variables:
$ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
$ medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
$ fedu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
$ mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
$ fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
$ studytime: Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ schoolsup: Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
$ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
$ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
$ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
$ health : int 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ grade : int 6 6 10 15 10 15 11 6 19 15 ...
We form a final overview of the variables that we want to use to
tackle our research question. To create an overview
in R, we use the summary() function as
follows:
school sex age address medu fedu mjob
GP:349 F:208 Min. :15.0 R: 88 0: 3 0: 2 at_home : 59
MS: 46 M:187 1st Qu.:16.0 U:307 1: 59 1: 82 health : 34
Median :17.0 2:103 2:115 other :141
Mean :16.7 3: 99 3:100 services:103
3rd Qu.:18.0 4:131 4: 96 teacher : 58
Max. :22.0
fjob studytime failures schoolsup famsup paid
at_home : 20 1:105 Min. :0.0000 no :344 no :153 no :214
health : 18 2:198 1st Qu.:0.0000 yes: 51 yes:242 yes:181
other :217 3: 65 Median :0.0000
services:111 4: 27 Mean :0.3342
teacher : 29 3rd Qu.:0.0000
Max. :3.0000
higher internet health absences grade
no : 20 no : 66 Min. :1.000 Min. : 0.000 Min. : 0.00
yes:375 yes:329 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
Median :4.000 Median : 4.000 Median :11.00
Mean :3.554 Mean : 5.709 Mean :10.42
3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:14.00
Max. :5.000 Max. :75.000 Max. :20.00
In this section, we assess the presence of any missing values. To
check for the missing values (NA’s), in R, we use the
find.na() function as follows:
[1] " No missing values (NA) in the dataset."
This message indicates that this data frame does not have any missing values (NA’s).
We can also check for the missing values (NA’s),
in R, by plotting the missing values through the
gg_miss_var() function:
Based on the displayed plot above, we can again infer that our data frame is devoid of any NA’s, meaning that there are no missing values in our data set.
Outliers or unusual values are observations that deviate significantly from the overall pattern of the dataset. When sifting through a large volume of data, outliers can often become obscured in visual representations like histograms or two-dimensional scatter plots.
After reviewing the data summary, we have reason to believe that the
numerical variables grade and absences, and
the categorical variables medu and fedu may
potentially include outliers.
grade and absences
To investigate this further, let’s examine the histograms and
two-dimensional scatter plot for the variables grade and
`absences.
ggplot( data = data, aes( x = grade ) ) +
geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec") +
labs( title = "Histogram target variable 'grade'" ) +
theme(plot.title = element_text(hjust = 0.5))ggplot( data = data, aes( x = absences ) ) +
geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec") +
labs( title = "Histogram variable 'absences'" ) +
theme(plot.title = element_text(hjust = 0.5))gg <- ggplot( data = data, mapping = aes( x = grade, y = absences ) ) +
geom_point( colour = 'blue' ) +
labs( title = "Two-dimensional scatter plot 'grade' versus 'absences'" ) +
theme(plot.title = element_text(hjust = 0.5))
gg <- gg + geom_point(data = data.frame(x = c(0, 8, 9, 11), y = c(0, 56, 75, 54 )), aes(x, y), color = 'red', size = 10, shape = 1)
print(gg)
Based on the above plots, it seems reasonable to infer that the
target variable grade contains outliers. The accumulation
of students with a grade of 0 may indicate that these students were
either absent during the examination or failed to meet the minimum
requirements. Therefore, we will classify the obtained grades of 0 as
outliers. Furthermore, there also appears to be outliers in the variable
absences. These outliers are characterized by an unusually
high number of absences, which deviate from the typical distribution of
variable. Therefore, we will classify any number of absences exceeding
40 as an outlier.
To address the unusual values in the grade and
absences variables, we will begin by substituting these
outliers with missing values. To achieve this, we call the
mutate() function to replace the variables with a
modified copy. The ifelse() function is utilized to
replace the unusual values with NA (missing values).
data = mutate( data, grade = ifelse( grade == 0, NA, grade ) )
data = mutate( data, absences = ifelse( absences > 40, NA, absences ) ) Instead of removing the outliers, we have opted for an approach
called imputation. In this case, we will impute the missing values by
assigning random values in proportion to the records within each
category. To accomplish this, we use the impute()
function as follows:
data $ grade = impute( data $ grade, 'random' )
data $ absences = impute( data $ absences, 'random' )To assess whether the mutation and imputation process had any impact,
we will analyze the histograms, two-dimensional scatter plot and the
imputed summary of the variables grade and
absences after these procedures have been applied:
ggplot( data = data, aes( x = grade ) ) +
geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec" ) +
labs( title = "Histogram target variable 'grade'" ) +
theme(plot.title = element_text(hjust = 0.5))ggplot( data = data, aes( x = absences ) ) +
geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec" ) +
labs( title = "Histogram variable 'absences'" ) +
theme(plot.title = element_text(hjust = 0.5))ggplot( data = data, mapping = aes( x = grade, y = absences ) ) +
geom_point( colour = 'blue' ) +
labs( title = "Two-dimensional scatter plot 'grade' versus 'absences'" ) +
theme(plot.title = element_text(hjust = 0.5))
Imputed Values:
data$grade
n missing distinct Info Mean Gmd .05 .10
38 0 12 0.979 13.05 3.596 9.85 10.00
.25 .50 .75 .90 .95
10.25 12.00 15.75 18.00 18.15
Value 8.00 8.99 9.98 10.97 11.96 12.95 13.94 14.93 15.92 16.91 17.90
Frequency 1 1 8 8 3 1 2 4 3 2 3
Proportion 0.026 0.026 0.211 0.211 0.079 0.026 0.053 0.105 0.079 0.053 0.079
Value 19.00
Frequency 2
Proportion 0.053
For the frequency table, variable is rounded to the nearest 0.11
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 10.00 11.00 11.67 14.00 20.00
Imputed Values:
[1] 3 1 0
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 4.000 5.251 8.000 40.000
The outliers for the numerical variables grade and
absences appear to have been effectively imputed.
medu and fedu
We might also want to have a look at the categorical variables
medu and fedu. Note that the concept of an
outlier doesn’t apply to categorical data. However, categories with
significantly lower frequencies compared to the others can be considered
as outliers within categorical data. To explore this further, let’s
examine the frequencies by plotting both histograms.
ggplot( data = data ) +
geom_bar( aes( x = medu ), color = "#283a53", fill = c( "#b6d2e4") ) +
labs( title = "Bar plot for the target variable 'churn'" ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram binary grade classification") 0 1 2 3 4
3 59 103 99 131
ggplot( data = data ) +
geom_bar( aes( x = fedu ), color = "#283a53", fill = c( "#b6d2e4") ) +
labs( title = "Bar plot for the target variable 'churn'" ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram binary grade classification") 0 1 2 3 4
2 82 115 100 96
Based on the above histograms, it seems reasonable to infer that the
variables medu and fedu contain outliers. The
category 0, representing individuals with no formal education, contains
significantly fewer observations comared to the other categories. Hence,
we will identify parents with no formal education as outliers.
To address the unusual values in the medu and fedu variables, we will
start by replacing these outliers with missing values (NA’s).
Subsequently, rather than removing these outliers, we have chosen to
impute them using the mode. To accomplish this, we use the
is.na() function as follows:
data $ medu[data $ medu == 0] <- NA
data $ medu = impute( data $ medu, 4 )
data $ medu <- droplevels(data$medu)
data $ fedu[data $ fedu == 0] <- NA
data $ fedu = impute( data $ fedu, 2 )
data $ fedu <- droplevels(data $ fedu) To assess whether the mutation and imputation process had any impact, we will analyze the histograms and the imputed summary of the variables medu and fedu after these procedures have been applied:
ggplot( data = data ) +
geom_bar( aes( x = medu ), color = "#283a53", fill = c( "#b6d2e4") ) +
labs( title = "Bar plot for the target variable 'churn'" ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram binary grade classification") 1 2 3 4
59 103 99 134
ggplot( data = data ) +
geom_bar( aes( x = fedu ), color = "#283a53", fill = c( "#b6d2e4") ) +
labs( title = "Bar plot for the target variable 'churn'" ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram binary grade classification") 1 2 3 4
82 117 100 96
The outliers for the categorical variables medu and
fedu appear to have been effectively imputed.
The data was gathered during the 2005-2006 school year from two
public schools (Gabriel Pereira and Mousinho da Silveira), both situated
in the Alentejo region of Portugal. The data was collected as part of a
comprehensive survey conducted among students enrolled in mathematics
courses at the secondary school level. Our modified dataset contains 395
observations for 16 predictors along with the target variable
grade. The variables are described as follows:
school: students’ school (Gabriel Pereira or
Mousinho da Silveira)
sex: students’ sex (female or male)
age: students’ age in years (ranging from 15 to
22)
address: students’ home address type (urban or
rural)
medu: mother’s education (scale 0 to 4. 0: none, 1:
primary education, 2: 5th to 9th grade, 3: secondary education and 4:
higher education)
fedu: father’s education (scale 0 to 4. 0: none, 1:
primary education, 2: 5th to 9th grade, 3: secondary education and 4:
higher education)
mjob: mother’s job (teacher, health care related,
civil services, at home or other)
fjob: father’s job (teacher, health care related,
civil services, at home or other)
studytime: weekly study time (scale 1 to 4). 1: <
2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours or 4: > 10 hours)
failures: number of past class failures (ranging
from 0 to 3)
schoolsup: extra educational school support (yes or
no)
famsup: family educational support (yes or
no)
paid: extra paid classes (yes or no)
higher: wants to take higher education (yes or
no)
internet: internet access at home (yes or
no)
absences: number of school absences (ranging from 0
to 40)
grade: final grade (ranging from 0 to 20)
The variable types can be categorized as follows:
| Variable | R datatype | Statistical datatype |
|---|---|---|
school |
factor |
categorical - binary |
sex |
factor |
categorical - binary |
age |
int |
numerical - discrete |
address |
factor |
categorical - binary |
medu |
int |
numerical - discrete |
fedu |
int |
numerical - discrete |
mjob |
factor |
categorical - nominal |
fjob |
factor |
categorical - nominal |
studytime |
int |
numerical - discrete |
failures |
int |
numerical - discrete |
schoolsup |
factor |
categorical - binary |
famsup |
factor |
categorical - binary |
paid |
factor |
categorical - binary |
higher |
factor |
categorical - binary |
internet |
factor |
categorical - binary |
absences |
int |
numerical - discrete |
grade |
int |
numerical - discrete |
In order to further our understanding about grade and
the variables affecting it, we will use graphs, plots, and tables to
uncover relationships that are relevant for further investigation. We
first start by looking at the general distribution of the grades
obtained by high school students in Portugal.
gradeIn Portugal, high schools and higher education institutions use a 20-point grading scale. with the minimum passing grade being a 10. This means that students need to score a 10 or higher to pass the Mathematics course. First, we generate a simple overview of the range of the grades obtained by the students.
grade_summary <- capture.output(summary(data$grade))
cleaned_summary <- tail(grade_summary, 2)
cat(cleaned_summary, sep = "\n") Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 10.00 11.00 11.67 14.00 20.00
We will create a density plot to offer a more in-depth view of the distribution of final grades. Density plots are a valuable tool in data visualization for display the distribution of a continuous variables.
ggplot(data = data) +
geom_density(aes(x = grade), color = "#283a53", fill = "#d5e0ec", alpha = 0.8) +
ggtitle("General distribution of the final grades") +
xlab("Final grades") + ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))The density plot of the target variable grade shown
above exhibits a distribution that closely resembles a normal
distribution.
In this part, the grades obtained by the students we be categorized
into a binary classification. This categorization allows us to create
plots where the target variable is binary. We will categorize the grades
into the binary classification by calling the function
cut() as follows:
cutgrade2 <- cut(data $ grade, breaks=c(0, 9, 20), labels = c('Fail', 'Pass'), include.lowest = TRUE) Binary classification: a student passed the course if the final grade is greater than or equal to 10; otherwise they failed.
Fail: grades 0 to 9
Pass: grades 10 to 20
ggplot( data = data ) +
geom_bar( aes( x = cutgrade2 ), color = "#283a53", fill = c( "#b6d2e4") ) +
labs( title = "Bar plot for the target variable 'churn'" ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram binary grade classification") +
xlab("classification") + ylab("count") Out of the total number of students, 76.2% (301 students) successfully passed the Mathematics course, while 23.8% (94 students) did not meet the passing requirements for the course.
school
In this section we investigate the relationship between the variable
school and the target variable grade.
GP MS
349 46
Out of the 395 students, 88.4% (349 students) went to Gabriel Pereira and 11.6% (46 students) went to Mousinho da Silveira.
ggplot( data = data ) +
geom_boxplot( aes( x = school, y = grade ), fill = c("#b6d2e4", "#f8a5a7") ) +
scale_x_discrete(labels= c('Gabriel Pereira', 'Mousinho da Silveira')) +
xlab("Students' school") + ylab("Final grades") +
ggtitle("Boxplot variable school") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data, aes( x = school, fill = cutgrade2 ) ) +
geom_bar( position = "fill", color = "#283a53") +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Proportional histogram variable age") The above plots do not provide compelling graphical evidence of the
predictive importance of the variable school. It appears
that the grade distributions at the two schools are approximately equal.
To investigate this further, we will conduct a hypothesis test to assess
the potential dependency between the variables school and
grade.
Welch Two Sample t-test
data: grade by school
t = 1.8636, df = 60.222, p-value = 0.06726
alternative hypothesis: true difference in means between group GP and group MS is not equal to 0
95 percent confidence interval:
-0.06464906 1.82931831
sample estimates:
mean in group GP mean in group MS
11.77364 10.89130
We do not reject the null-hypothesis since the p-value >𝛼=0.05.
Thus, at a significance level of 5%, we have found insufficient evidence
to infer that the mean final grade distribution between Gabriel Pereira
(GP) and Mousinho da Silveira (MS) are different. Based on the
two-sample T-test there doesn’t not seem to be a significant
relationship between the variable school and the target
variable grade.
sex
In this section we investigate the relationship between the variable
sex and the target variable grade.
F M
208 187
Out of the 395 students, 52.7% (208 students) are female and 47.3% (187 students) are male.
ggplot( data = data ) +
geom_boxplot( aes( x = sex, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
scale_x_discrete(labels= c('Female', 'Male')) +
xlab("Students' school") + ylab("Final grades") +
ggtitle("Boxplot variable sex") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = sex ), alpha = 0.3 ) +
ggtitle("Density plot variable sex") +
theme(plot.title = element_text(hjust = 0.5)) The above plots do not provide compelling graphical evidence of the
predictive importance of the variable sex. It appears that
the grade distributions between females and males are approximately
equal. To investigate this further, we will conduct a hypothesis test to
assess the potential dependency between the variables sex
and grade.
Welch Two Sample t-test
data: grade by sex
t = -1.1663, df = 390.46, p-value = 0.2442
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
-1.0239690 0.2614228
sample estimates:
mean in group F mean in group M
11.49038 11.87166
We do not reject the null-hypothesis since the p-value >𝛼=0.05.
Thus, at a significance level of 5%, we have found insufficient evidence
to infer that the mean final grades for male and female students are
different. Based on the two-sample T-test there does not seems to be a
significant relationship between the variable sex and the
target variable grade. Hence, H1 will be
rejected.
H2h: Students of the sex Male perform academically better than their female peers has been rejected!
age*
In this section we investigate the relationship between the variable
age and the target variable grade.
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.0 16.0 17.0 16.7 18.0 22.0
ggplot( data = data ) +
geom_bar( aes( x = age, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram variable age") ggplot( data = data, aes( x = age, fill = cutgrade2 ) ) +
geom_bar( position = "fill", color = "#283a53") +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Proportional histogram variable age") ggplot(data=data, aes(x=age, y=grade))+
geom_point(color="#87adc4") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Scatterplot grade versus age") The above plots indicate graphical evidence of the predictive
importance of the variable age. It seems to be the case
that older students fail classes more often than than younger
classmates. To investigate this further, we will conduct a hypothesis
test to assess the potential dependency between the variables
age and grade.
Pearson's product-moment correlation
data: data$grade and data$age
t = -2.6252, df = 393, p-value = 0.004499
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.00000000 -0.04892303
sample estimates:
cor
-0.1312777
We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that there exists a significant negative relationship between the age of students and their academic grades. Hence, H2 will be accepted.
studytime*
In this section we investigate the relationship between the variable
studytime and the target variable grade.
Studytime is a categorical variable representing the weekly study time
(scale 1 to 4).
1 2 3 4
105 198 65 27
The scales represent the following:
1: represents students that studied 2 hours or less a week
2: represents students that studied between 2 to 5 hours a week
3: represents students that studied between 5 to 10 hours a week
4: represents students that studied more than 10 hours a week
ggplot( data = data ) +
geom_bar( aes( x = studytime, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Histogram variable studytime") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data, aes( x = studytime, fill = cutgrade2 ) ) +
geom_bar( position = "fill", color = "#283a53") +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Proportional histogram variable studytime") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable studytime. It appears to be the
case that students who dedicate more time to their studies tend to
achieve higher grades compared to those who study for shorter durations.
To investigate this further, we will conduct a hypothesis test to assess
the potential dependency between the variables studytime
and grade.
Df Sum Sq Mean Sq F value Pr(>F)
studytime 3 134 44.6 4.33 0.0051 **
Residuals 391 4027 10.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null-hypothesis since the p-value \(< \alpha=0.05\). Therefore, at a
significance level of 5%, we have found sufficient evidence to infer
that there seems to be a significant positive relationship between the
variable studytime and the target variable
grade.Hence, H3 will be accepted.
failures*
In this section we investigate the relationship between the variable
failures and the target variable grade.
0 1 2 3
312 50 17 16
ggplot( data = data ) +
geom_bar( aes( x = failures, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Histogram variable failures") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data, aes( x = failures, fill = cutgrade2 ) ) +
geom_bar( position = "fill", color = "#283a53") +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Proportional histogram variable failures") +
theme(plot.title = element_text(hjust = 0.5)) ggplot(data=data, aes(x=failures, y=grade))+
geom_point(color="#87adc4") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
ggtitle("Scatterplot grade versus failures") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable failures. It appears to be the
case that students who have failed their class tend to achieve lower
grades compared to those who have not failed their classes. To
investigate this further, we will conduct a correlation test to assess
the potential dependency between the variables failures and
grade.
Pearson's product-moment correlation
data: data$grade and data$failures
t = -4.6299, df = 393, p-value = 2.489e-06
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.0000000 -0.1473185
sample estimates:
cor
-0.2274285
We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a
significance level of 5%, we have found sufficient evidence to infer
that there exists an inverse relationship between the variables
failures and grade. In other words, an
increase in the number of past class failures correspond to lower
academic performance. Hence, H4 will be accepted.
absences*
In this section we investigate the relationship between the variable
absences and the target variable grade.
absences_summary <- capture.output(summary(data$absences))
cleaned_summary <- tail(absences_summary, 2)
cat(cleaned_summary, sep = "\n") Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 4.000 5.251 8.000 40.000
ggplot( data = data ) +
geom_density( aes( x = absences, fill = cutgrade2 ), alpha = 0.3 ) +
ggtitle("Density plot variable absences") +
theme(plot.title = element_text(hjust = 0.5)) ggplot(data=data, aes(x=absences, y=grade))+
geom_point(color="#87adc4") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
ggtitle("Scatterplot grade versus absences") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable absences. It seems to be the
case that students that were absent more than 10 times score way lower
grades compared to those that were absent less than 10 times. To
investigate this further, we will conduct a hypothesis test to assess
the potential dependency between the variables absences and
grade.
Pearson's product-moment correlation
data: data$grade and data$absences
t = -4.8196, df = 393, p-value = 1.029e-06
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.00000 -0.15641
sample estimates:
cor
-0.2362345
We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a
significance level of 5%, we have found sufficient evidence to infer
that there exists an inverse relationship between the variables
absences and grade. Hence, H5
will be accepted.
higher*
In this section we investigate the relationship between the variable
higher and the target variable grade.
ggplot( data = data ) +
geom_bar( aes( x = higher, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Histogram variable higher") +
theme(plot.title = element_text(hjust = 0.5)) Out of the 395 students, 94.9% (375 students) of the students want to pursue higher education and 5.1% (20 students) are not planning to pursue higher education.
ggplot( data = data ) +
geom_boxplot( aes( x = higher, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
scale_x_discrete(labels= c('no', 'yes')) +
xlab("pursue higher education") + ylab("final grades") +
ggtitle("Boxplot variable internet") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = higher ), alpha = 0.3 ) +
ggtitle("Density plot variable higher") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable higher. It appears that students
who aspire to pursue higher education tend to achieve higher grades
compared to those who do not. To investigate this further, we will
conduct a hypothesis test to assess the potential dependency between the
variables higher and grade.
Welch Two Sample t-test
data: grade by higher
t = -4.0384, df = 26.974, p-value = 0.0002002
alternative hypothesis: true difference in means between group no and group yes is less than 0
95 percent confidence interval:
-Inf -0.9872031
sample estimates:
mean in group no mean in group yes
10.05000 11.75733
We reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at
a significance level of 5%, we have found insufficient evidence to infer
that the mean final grade of students with aspirations for higher
education is higher than those who do not have such aspirations. Based
on the two-sample T-test there seems to be a significant positive
relationship between the variable higher and the target
variable grade. Hence, H6 will be
accepted.
internet*
In this section we investigate the relationship between the variable
internet and the target variable grade.
ggplot( data = data ) +
geom_bar( aes( x = internet, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Histogram variable internet") +
theme(plot.title = element_text(hjust = 0.5)) Out of the 395 students, 83.3% (329 students) have access to internet and 16.7% (66 students) do not have access to internet.
ggplot( data = data ) +
geom_boxplot( aes( x = internet, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
scale_x_discrete(labels= c('no access', 'access')) +
xlab("access to internet") + ylab("final grades") +
ggtitle("Boxplot variable internet") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = internet ), alpha = 0.3 ) +
ggtitle("Density plot variable internet") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate slight graphical evidence of the predictive
importance of the variable internet. It appears to be the
case that students who have do not have access to internet score lower
grades compared to those students that do have access to internet. To
investigate this further, we will conduct a correlation test to assess
the potential dependency between the variables internet and
grade.
Welch Two Sample t-test
data: grade by internet
t = -1.9169, df = 96.666, p-value = 0.0291
alternative hypothesis: true difference in means between group no and group yes is less than 0
95 percent confidence interval:
-Inf -0.107632
sample estimates:
mean in group no mean in group yes
11.00000 11.80547
We do not reject the null-hypothesis since the p-value <𝛼=0.05.
Thus, at a significance level of 5%, we have found sufficient evidence
to infer that students with access to internet tend to achieve higher
grades than those without access to internet. Based on the two-sample
T-test there seems to be a significant positive relationship between the
variable internet and the target variable
grade. Hence, H7 will be accepted. Note,
however, that the two-sided hypothesis test for the variable
internet results in insignificance at a 5% significance
level.
address*
In this section we investigate the relationship between the variable
address and the target variable grade.
ggplot( data = data ) +
geom_bar( aes( x = address, fill = cutgrade2 ), color = "#283a53" ) +
scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) +
ggtitle("Histogram variable address") +
theme(plot.title = element_text(hjust = 0.5)) Out of the 395 students, 77.7% (307 students) of the students live in urban areas and 22.3% (88 students) live in rural areas.
ggplot( data = data ) +
geom_boxplot( aes( x = address, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
scale_x_discrete(labels= c('Rural (R)', 'Urban (U)')) +
xlab("Address type") + ylab("final grades") +
ggtitle("Boxplot variable address") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = address ), alpha = 0.3 ) +
ggtitle("Density plot variable address") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable addres. It appears to be the
case that students who live in urban areas score higher grades compared
to students that live in rural areas. To investigate this further, we
will conduct a correlation test to assess the potential dependency
between the variables addres and grade.
Welch Two Sample t-test
data: grade by address
t = -2.4924, df = 145.67, p-value = 0.006904
alternative hypothesis: true difference in means between group R and group U is less than 0
95 percent confidence interval:
-Inf -0.3193428
sample estimates:
mean in group R mean in group U
10.93182 11.88274
We reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at
a significance level of 5%, we have found sufficient evidence to infer
that students that reside in an urban area perform academically better
than those residing in a rural area.. Based on the two-sample T-test
there seems to be a significant relationship between the variable
addres and the target variable grade. Hence,
H8 will be accepted.
medu*
In this section we investigate the relationship between the variable
meduand the target variable grade. Medu is are
categorical variables representing the highest attained education level
of the student (scale 1 to 4).
1 2 3 4
59 103 99 134
The scales represent the following:
1: finished primary education
2: finished 5th to 9th grade of primary education
3: finished secondary education
4: finished higher education
ggplot( data = data, aes( x = medu, y = grade )) +
geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28') ) +
labs( x = "mothers' education", y = "final grade" ) +
ggtitle("Boxplot variable medu") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = medu ), alpha = 0.3, ) +
scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28')) +
ggtitle("Density plot variable medu") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable medu. It appears to be the case
that students who’s mother have had a higher education score higher
grades compared to students whose mother have had a lower education. To
investigate this further, we will conduct an ANOVA test to assess the
potential dependency between the variables medu and
grade.
Df Sum Sq Mean Sq F value Pr(>F)
medu 3 126 41.93 4.063 0.00732 **
Residuals 391 4035 10.32
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null-hypothesis since the p-value <𝛼=0.05
.Therefore, at a significance level of 5%, we have found sufficient
evidence to infer that there sems to be a significant positive
relationship between the variable medu and the target
variable grade.
fedu*
In this section we investigate the relationship between the variable
feduand the target variable grade. Fedu is a
categorical variable representing the highest attained education level
of the father of the student (scale 1 to 4).
1 2 3 4
82 117 100 96
The scales represent the following:
1: finished primary education
2: finished 5th to 9th grade of primary education
3: finished secondary education
4: finished higher education
ggplot( data = data, aes( x = fedu, y = grade )) +
geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28') ) +
labs( x = "fathers' education", y = "final grade" ) +
ggtitle("Boxplot variable fedu") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = fedu ), alpha = 0.3 ) +
ggtitle("Density plot variable fedu") +
scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28')) +
theme(plot.title = element_text(hjust = 0.5)) It again appears to be the case that students who’s father have had a
higher education score higher grades compared to students whose father
have had a lower education. To investigate this further, we will conduct
an ANOVA test to assess the potential dependency between the variables
fedu and grade.
Df Sum Sq Mean Sq F value Pr(>F)
fedu 3 118 39.29 3.799 0.0105 *
Residuals 391 4043 10.34
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null-hypothesis since the p-value <𝛼=0.05
.Therefore, at a significance level of 5%, we have found sufficient
evidence to infer that there seems to be a significant positive
relationship between the variable fedu and the target
variable grade.
Students with parents having higher levels of education tend to achieve higher academic performance compared to students with parents having lower levels of education. Hence, H9 will be accepted.
mjob*
In this section we investigate the relationship between the variable
mjoband the target variable grade. Mjob is a
nominal variable representing the type of job the mother is doing.
at_home health other services teacher
59 34 141 103 58
The type of jobs are the following:
Teacher (higher-educated occupation)
Health care related (higher-educated occupation)
Civil services
At home
Other
ggplot( data = data, aes( x = mjob, y = grade )) +
geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green') ) +
labs( x = "mothers' occupation", y = "final grade" ) +
ggtitle("Boxplot variable mjob") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = mjob ), alpha = 0.3 ) +
scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green')) +
ggtitle("Density plot variable mjob") +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable mjob. It seems to be the case
that student’s who’s mother are working in the health care score
significantly higher compared to students whose parents are not working
in healthcare. To investigate this further, we will conduct a single
linear regression to assess the potential dependency between the
variables mjob and the target variable
grade.
Call:
lm(formula = grade ~ mjob, data = data)
Residuals:
Min 1Q Median 3Q Max
-7.1986 -2.1471 -0.1986 2.0840 7.8136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.18644 0.41834 26.740 < 2e-16 ***
mjobhealth 1.96062 0.69189 2.834 0.00484 **
mjobother 0.01214 0.49824 0.024 0.98057
mjobservices 0.85239 0.52465 1.625 0.10504
mjobteacher 0.60666 0.59417 1.021 0.30788
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.213 on 390 degrees of freedom
Multiple R-squared: 0.03225, Adjusted R-squared: 0.02232
F-statistic: 3.249 on 4 and 390 DF, p-value: 0.0122
When testing this through a single variable linear regression, we see
that students whose mother work in health care has an effect at a
significance level of 5% since the p-value < 0.05. Both the plots and
linear model suggest that mjob is a good predictor for the
target variable grade.
fjob
In this section we investigate the relationship between the variable
fjoband the target variable grade. Fjob is a
nominal variable representing the type of job the father is doing of the
student.
at_home health other services teacher
20 18 217 111 29
The type of jobs are the following:
Teacher (higher-educated occupation)
Health care related (higher-educated occupation)
Civil services
At home
Other
ggplot( data = data, aes( x = fjob, y = grade )) +
geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green') ) +
labs( x = "fathers' occupation", y = "final grade" ) +
ggtitle("Boxplot variable fjob") +
theme(plot.title = element_text(hjust = 0.5)) ggplot( data = data ) +
geom_density( aes( x = grade, fill = fjob ), alpha = 0.3 ) +
ggtitle("Density plot variable fjob") +
scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green')) +
theme(plot.title = element_text(hjust = 0.5)) The above plots indicate graphical evidence of the predictive
importance of the variable fjob. It seems to be the case
that student’s who’s father are working as a teacher score significantly
higher compared to students whose father is not working as a teaccher To
investigate this further, we will conduct an ANOVA test to assess the
potential dependency between the variables fjob and the
target variable grade.
Call:
lm(formula = grade ~ fjob, data = data)
Residuals:
Min 1Q Median 3Q Max
-7.5069 -2.0474 -0.5069 2.4931 8.4505
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.7500 0.7226 16.262 <2e-16 ***
fjobhealth -0.1389 1.0499 -0.132 0.8948
fjobother -0.2431 0.7551 -0.322 0.7477
fjobservices -0.2005 0.7850 -0.255 0.7986
fjobteacher 1.5948 0.9392 1.698 0.0903 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.231 on 390 degrees of freedom
Multiple R-squared: 0.02137, Adjusted R-squared: 0.01133
F-statistic: 2.129 on 4 and 390 DF, p-value: 0.07656
When testing this through a single variable linear regression, we see
that all categories within the variable are not significant at a
significance level of 5% since the p-value is > 0.05. The plots and
the linear regression model suggest that fjob is not a good
predictor for the target variable grade.
Students with parents having higher-educated occupations (teacher and health-care) tend to achieve higher academic performance compared to students with parents having lower-educated occupations (civil-service, at home and other). Hence, H10 will be accepted.
In this section we prepare our data frame for modeling. We partition
our data set randomly into two groups: a train set (80%) and a test set
(20%). We use the partition() ` function from the
liver package as follows:
set.seed(1)
data_sets = partition( data = data, prob = c( 0.8, 0.2 ) )
train_set = data_sets $ part1
test_set = data_sets $ part2
actual_test = test_set $ gradeNote that here we are using the set.seed() function in
this particular context to establish the reproducibility of our
results.
To validate the effectiveness of the partitioning process, we will
compare the means of our target variable grade between the
train set and test set. To assess whether the population means in the
grades obtained by students differs between the two data sets, we will
conduct a Two-sample T-test for the difference in means. The hypotheses
are as follows:
\[ \bigg\{ \begin{matrix} H_0: \mu_1 = \mu_2 \\ H_1: \mu_1 \neq \mu_2 \end{matrix} \]
To run the Two-sample T-test, we use the t.test()
function as follows:
Two Sample t-test
data: train_set$grade and test_set$grade
t = 0.2525, df = 393, p-value = 0.8008
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7266068 0.9407451
sample estimates:
mean of x mean of y
11.69040 11.58333
We do not reject the null-hypothesis since the p-value \(> \alpha=0.05\). Therefore, at a significance level of 5%, we have found insufficient evidence to infer that the mean difference in grades obtained by students are different between the train set and test set.
To further validate the partition we test whether the variance in the grades obtained by students differs between the two data sets. To assess whether the variance in the grades obtained by students differs between the two data sets, we will conduct a F-test for the difference in variances. The hypotheses are as follows:
\[ \bigg\{ \begin{matrix} H_0: \sigma_1 = \sigma_2 \\ H_1: \sigma_1 \neq \sigma_2 \end{matrix} \]
To run the F-test, we use the var.test() function as
follows:
F test to compare two variances
data: train_set$grade and test_set$grade
F = 1.2556, num df = 322, denom df = 71, p-value = 0.2476
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8524131 1.7714166
sample estimates:
ratio of variances
1.255587
We again do not reject the null-hypothesis since the p-value \(> \alpha=0.05\). Therefore, at a significance level of 5%, we have found insufficient evidence to infer that the variance in grades obtained by students are different between the train set and test set.
The results of both the Two-sample T-test for the difference in means
and the F-test for the difference in variances provide significant
evidence that the partition for the target variable grade
has been validated.
In “Exploratory Data Analysis (EDA)” the 16 predicting variables of
the data frame were inspected. The EDA indicates that the following
predictors might be significant in drawing conclusion about the effect
of socioeconomic status on the academic performance of students:
age, address, fedu,
medu, mjob, studytime,
failures, schoolsup, paid,
higher and absences.
With the mentioned variables the hypotheses will be tested with different type of Machine Learning algorithms. This will be done with the following formula:
formula_significant = grade ~ (age + address + medu + fedu + mjob + studytime + failures + schoolsup + paid + higher + absences)The multiple linear regression will be created using the lm function. This will give the following equation:
\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + b_3x_3 + b_4x_4 + b_5x_5 + b_6x_6 + b_7x_7 + b_8x_8 + b_9x_9 + b_{10}x_{10} + b_{11}x_{11} \]
To explore the linear regression results, run the
summary() command with the name of the specified model.
Call:
lm(formula = formula_significant, data = train_set)
Residuals:
Min 1Q Median 3Q Max
-6.7756 -2.1003 -0.0619 1.8515 7.3738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.31428 2.86590 4.297 2.34e-05 ***
age -0.09833 0.15140 -0.649 0.516546
addressU 0.81030 0.42413 1.911 0.057013 .
medu2 0.51107 0.57726 0.885 0.376673
medu3 0.49407 0.64745 0.763 0.446002
medu4 0.98941 0.76251 1.298 0.195425
fedu2 0.37074 0.50154 0.739 0.460351
fedu3 0.11499 0.56842 0.202 0.839813
fedu4 0.72620 0.63757 1.139 0.255600
mjobhealth 0.73330 0.80956 0.906 0.365760
mjobother -0.50101 0.53895 -0.930 0.353322
mjobservices 0.37116 0.59728 0.621 0.534793
mjobteacher -0.85252 0.77903 -1.094 0.274679
studytime2 -0.32565 0.41426 -0.786 0.432416
studytime3 0.86778 0.54233 1.600 0.110624
studytime4 1.66591 0.78228 2.130 0.034017 *
failures -0.59632 0.24676 -2.417 0.016259 *
schoolsupyes -2.53991 0.50129 -5.067 7.06e-07 ***
paidyes -0.65197 0.35384 -1.843 0.066371 .
higheryes 1.00085 0.78596 1.273 0.203852
absences -0.10606 0.02820 -3.761 0.000203 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.965 on 302 degrees of freedom
Multiple R-squared: 0.2497, Adjusted R-squared: 0.2
F-statistic: 5.025 on 20 and 302 DF, p-value: 9.99e-11
As the results above shows are schoolsup and
absences significant at \(\alpha=0.001\). Failures and
studytime4 are significant at \(\alpha=0.05\). The following predictors are
all not significant at \(\alpha=0.05\):
age, address, fedu,
medu, mjob, studytime (apart from
studytime4), paid and higher. In the model
above, the residual standard error is approximately 2.965. This means
that the linear regression model, with all significant predictors,
predicts the grade of the student with an average error of 2.965. The
adjusted R-squared is approximately 0.2, indicating that about 20.00% of
the variance in the target variable grade is explained by
the predicting variables included in the model.
With stepwise linear regression it is possible to add or remove
predicting variables based on their significance. This will be done to
identify the most relevant variables for predicting the target variable.
The function that will be used is step() with direction
both, so that predicting variables will be added or removed from the
regression.
The results of the step-wise linear regression can be looked at
calling the summary function:
Start: AIC=722.47
grade ~ (age + address + medu + fedu + mjob + studytime + failures +
schoolsup + paid + higher + absences)
Df Sum of Sq RSS AIC
- medu 3 15.639 2671.1 718.36
- fedu 3 17.421 2672.8 718.58
- age 1 3.709 2659.1 720.92
- higher 1 14.258 2669.7 722.20
<none> 2655.4 722.47
- mjob 4 79.455 2734.9 723.99
- paid 1 29.852 2685.3 724.08
- address 1 32.094 2687.5 724.35
- failures 1 51.349 2706.8 726.65
- studytime 3 103.019 2758.4 728.76
- absences 1 124.356 2779.8 735.25
- schoolsup 1 225.729 2881.2 746.82
Step: AIC=718.36
grade ~ age + address + fedu + mjob + studytime + failures +
schoolsup + paid + higher + absences
Df Sum of Sq RSS AIC
- age 1 3.709 2674.8 716.81
- fedu 3 38.809 2709.9 717.02
- higher 1 12.347 2683.4 717.85
<none> 2671.1 718.36
- paid 1 26.866 2697.9 719.60
- mjob 4 83.576 2754.6 720.31
- address 1 36.696 2707.8 720.77
+ medu 3 15.639 2655.4 722.47
- failures 1 57.641 2728.7 723.26
- studytime 3 112.081 2783.1 725.64
- absences 1 116.670 2787.7 730.17
- schoolsup 1 227.021 2898.1 742.71
Step: AIC=716.81
grade ~ address + fedu + mjob + studytime + failures + schoolsup +
paid + higher + absences
Df Sum of Sq RSS AIC
- fedu 3 42.519 2717.3 715.91
- higher 1 15.934 2690.7 716.73
<none> 2674.8 716.81
- paid 1 26.853 2701.6 718.04
+ age 1 3.709 2671.1 718.36
- mjob 4 84.726 2759.5 718.88
- address 1 40.381 2715.2 719.65
+ medu 3 15.640 2659.1 720.92
- failures 1 62.605 2737.4 722.28
- studytime 3 112.296 2787.1 724.10
- absences 1 135.649 2810.4 730.79
- schoolsup 1 227.262 2902.0 741.15
Step: AIC=715.91
grade ~ address + mjob + studytime + failures + schoolsup + paid +
higher + absences
Df Sum of Sq RSS AIC
<none> 2717.3 715.91
- higher 1 20.191 2737.5 716.30
+ fedu 3 42.519 2674.8 716.81
+ age 1 7.420 2709.9 717.02
+ medu 3 38.665 2678.6 717.28
- paid 1 29.038 2746.3 717.34
- address 1 42.008 2759.3 718.86
- mjob 4 108.092 2825.4 720.51
- studytime 3 105.564 2822.9 722.22
- failures 1 81.586 2798.9 723.46
- absences 1 137.244 2854.5 729.82
- schoolsup 1 221.817 2939.1 739.25
Call:
lm(formula = grade ~ address + mjob + studytime + failures +
schoolsup + paid + higher + absences, data = train_set)
Residuals:
Min 1Q Median 3Q Max
-6.7806 -2.1537 0.0297* 1.9296 7.5318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.056469 0.864954 12.783 < 2e-16 ***
addressU 0.909050 0.415923 2.186 0.02959 *
mjobhealth 1.436078 0.709903 2.023 0.04394 *
mjobother -0.342567 0.512123 -0.669 0.50405
mjobservices 0.789537 0.541137 1.459 0.14557
mjobteacher 0.004827 0.633383 0.008 0.99392
studytime2 -0.370854 0.408066 -0.909 0.36416
studytime3 0.885870 0.529945 1.672 0.09561 .
studytime4 1.544417 0.771251 2.002 0.04611 *
failures -0.725731 0.238263 -3.046 0.00252 **
schoolsupyes -2.409775 0.479809 -5.022 8.65e-07 ***
paidyes -0.639157 0.351732 -1.817 0.07016 .
higheryes 1.157629 0.763977 1.515 0.13073
absences -0.107179 0.027130 -3.951 9.67e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.965 on 309 degrees of freedom
Multiple R-squared: 0.2322, Adjusted R-squared: 0.1999
F-statistic: 7.188 on 13 and 309 DF, p-value: 2.921e-12
The result of the stepwise regression is that we keep
address, mjob, studytime,
failures, schoolsup, paid,
higher and absences as predictors for the
final grade. These variables are significant with the following
significance: * \(\alpha=0.001\):
schoolsup and absences * \(\alpha=0.01\): failures *
\(\alpha=0.05\):
studytime4, mjobhealth and
address Conclusion is that mjob (apart from
mjobhealth), studytime (apart from studytime4),
paid and higher are not significant at \(\alpha=0.05\). The variables that are
dropped from the regression are age, medu and
fedu.
When comparing both models, the stepwise model provides a lower
adjusted r-squared than of the original regression model, of 0.0001.
Conclusion that can be drawn is that more of the variance is explained
with Multiple Linear Regression, however this difference is minimal. The
stepwise regression explains the dependent variable grade
to almost the same extent with less variables and thus should be the
preferred model in this case.
The final model will be defined as above:
Start: AIC=722.47
grade ~ (age + address + medu + fedu + mjob + studytime + failures +
schoolsup + paid + higher + absences)
Df Sum of Sq RSS AIC
- medu 3 15.639 2671.1 718.36
- fedu 3 17.421 2672.8 718.58
- age 1 3.709 2659.1 720.92
- higher 1 14.258 2669.7 722.20
<none> 2655.4 722.47
- mjob 4 79.455 2734.9 723.99
- paid 1 29.852 2685.3 724.08
- address 1 32.094 2687.5 724.35
- failures 1 51.349 2706.8 726.65
- studytime 3 103.019 2758.4 728.76
- absences 1 124.356 2779.8 735.25
- schoolsup 1 225.729 2881.2 746.82
Step: AIC=718.36
grade ~ age + address + fedu + mjob + studytime + failures +
schoolsup + paid + higher + absences
Df Sum of Sq RSS AIC
- age 1 3.709 2674.8 716.81
- fedu 3 38.809 2709.9 717.02
- higher 1 12.347 2683.4 717.85
<none> 2671.1 718.36
- paid 1 26.866 2697.9 719.60
- mjob 4 83.576 2754.6 720.31
- address 1 36.696 2707.8 720.77
+ medu 3 15.639 2655.4 722.47
- failures 1 57.641 2728.7 723.26
- studytime 3 112.081 2783.1 725.64
- absences 1 116.670 2787.7 730.17
- schoolsup 1 227.021 2898.1 742.71
Step: AIC=716.81
grade ~ address + fedu + mjob + studytime + failures + schoolsup +
paid + higher + absences
Df Sum of Sq RSS AIC
- fedu 3 42.519 2717.3 715.91
- higher 1 15.934 2690.7 716.73
<none> 2674.8 716.81
- paid 1 26.853 2701.6 718.04
+ age 1 3.709 2671.1 718.36
- mjob 4 84.726 2759.5 718.88
- address 1 40.381 2715.2 719.65
+ medu 3 15.640 2659.1 720.92
- failures 1 62.605 2737.4 722.28
- studytime 3 112.296 2787.1 724.10
- absences 1 135.649 2810.4 730.79
- schoolsup 1 227.262 2902.0 741.15
Step: AIC=715.91
grade ~ address + mjob + studytime + failures + schoolsup + paid +
higher + absences
Df Sum of Sq RSS AIC
<none> 2717.3 715.91
- higher 1 20.191 2737.5 716.30
+ fedu 3 42.519 2674.8 716.81
+ age 1 7.420 2709.9 717.02
+ medu 3 38.665 2678.6 717.28
- paid 1 29.038 2746.3 717.34
- address 1 42.008 2759.3 718.86
- mjob 4 108.092 2825.4 720.51
- studytime 3 105.564 2822.9 722.22
- failures 1 81.586 2798.9 723.46
- absences 1 137.244 2854.5 729.82
- schoolsup 1 221.817 2939.1 739.25
Call:
lm(formula = grade ~ address + mjob + studytime + failures +
schoolsup + paid + higher + absences, data = train_set)
Residuals:
Min 1Q Median 3Q Max
-6.7806 -2.1537 0.0297* 1.9296 7.5318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.056469 0.864954 12.783 < 2e-16 ***
addressU 0.909050 0.415923 2.186 0.02959 *
mjobhealth 1.436078 0.709903 2.023 0.04394 *
mjobother -0.342567 0.512123 -0.669 0.50405
mjobservices 0.789537 0.541137 1.459 0.14557
mjobteacher 0.004827 0.633383 0.008 0.99392
studytime2 -0.370854 0.408066 -0.909 0.36416
studytime3 0.885870 0.529945 1.672 0.09561 .
studytime4 1.544417 0.771251 2.002 0.04611 *
failures -0.725731 0.238263 -3.046 0.00252 **
schoolsupyes -2.409775 0.479809 -5.022 8.65e-07 ***
paidyes -0.639157 0.351732 -1.817 0.07016 .
higheryes 1.157629 0.763977 1.515 0.13073
absences -0.107179 0.027130 -3.951 9.67e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.965 on 309 degrees of freedom
Multiple R-squared: 0.2322, Adjusted R-squared: 0.1999
F-statistic: 7.188 on 13 and 309 DF, p-value: 2.921e-12
The final grade=11.056469 + 0.909050 * addressU + 1.436078 * mjobhealth - 0.342567 * mjobother + 0.789537 * mjobservices - 0.004827 * mjobteacher - 0.370854 * studytime2 + 0.885870 * studytime3 + 1.544417 * studytime4 - 0.725731 * failures - 2.409775 * schoolsupyes - 0.639157 * paidyes + 1.157629 * higheryes - 0.107179 * absences.
To implement the model, certain assumptions have to be verified. The assumptions in question are: linearity, independence, normality, constant variance. These assumptions will be checked by inspecting the following plots:
Upon analyzing the residuals versus fits plot (upper-left plot), we found no evident patterns such as curvature or systematic nonuniformity in the vertical spread of points. Therefore, we confirm that the linearity and constant variance assumptions hold true for this example. As you can see in the Normal Q-Q plot, most of the points in the normal probability plot do line up on a straight line, so our normality assumption is accepted For this dataset, the independence assumption means that the grade of one student should not be dependent on the grades of other students. Each student’s grade should be independent of the grades of their peers. Therefore the independence assumption is satisfied.
Here, we want to classify what a students’ grade is, based on the
earlier listed set of predictors. So, we produce a decision tree based
on the CART algorithm. Here, we use the rpart() function from the
rpart package:
tree_cartG = rpart( formula = formula_significant, data = train_set, method = "anova" )
print( tree_cartG ) n= 323
node), split, n, deviance, yval
* denotes terminal node
1) root 323 3539.04000 11.690400
2) schoolsup=yes 46 297.32610 9.717391
4) fedu=1,4 22 111.86360 8.772727 *
5) fedu=2,3 24 147.83330 10.583330 *
3) schoolsup=no 277 3032.91000 12.018050
6) failures>=0.5 57 542.66670 10.333330
12) absences>=1 38 320.84210 9.631579 *
13) absences< 1 19 165.68420 11.736840 *
7) failures< 0.5 220 2286.54500 12.454550
14) mjob=at_home,other,teacher 145 1463.97200 11.813790
28) absences>=13.5 9 67.55556 8.777778 *
29) absences< 13.5 136 1307.97100 12.014710
58) fedu=1,3 61 645.31150 11.245900
116) age>=16.5 34 324.50000 10.500000
232) absences>=0.5 24 158.95830 9.708333 *
233) absences< 0.5 10 114.40000 12.400000 *
117) age< 16.5 27 278.07410 12.185190 *
59) fedu=2,4 75 597.28000 12.640000 *
15) mjob=health,services 75 647.94670 13.693330
30) absences>=7.5 19 138.94740 11.947370 *
31) absences< 7.5 56 431.42860 14.285710
62) medu=1,2 14 35.21429 12.642860 *
63) medu=3,4 42 345.83330 14.833330
126) studytime=1,2 31 265.41940 14.225810
252) age>=15.5 20 168.20000 13.300000 *
253) age< 15.5 11 48.90909 15.909090 *
127) studytime=3,4 11 36.72727 16.545450 *
This decision tree can be plotted using the rpart.plot()
function from the rpart.plot package:
The tree above has 13 decision nodes and 15 leaves. The CART algorithm has selected 8 predictors (schoolsup, failures, mjob, age, fedu, absences, medu and studytime) as important predictors to predict the mean grade. Where, the variable schoolsup is the most important predictor on predicting the mean grade, since it is the root node of the tree. Within the algorithm there can 15 predictions be found of the mean grade (15 leaves).
With the random forest algorithm, there will be made a series of decision trees and these will be combined into one final classification. This will be done with all the significant variables, we run the random forest algorithm, using the randomForest() function from the randomForest package:
random_forest_grade <- randomForest(formula_significant, data=train_set, importance=TRUE, proximity=TRUE)The importance of the variables can be visualized by inspecting the
dot-chart as as measured by the random forest algorithm. Importance,
best predictors, is listed in this order schoolsup
failures, absences, age,
paid mjob, studytime,
higher, medu, fedu and
address.
%IncMSE IncNodePurity
age 6.620826 303.96570
address 1.136158 102.42985
medu 4.770631 272.40380
fedu 3.981114 294.47902
mjob 5.594522 342.42802
studytime 5.445337 304.44290
failures 10.742009 211.11192
schoolsup 22.201919 201.75613
paid 6.583295 125.52535
higher 5.083429 33.14082
absences 9.283945 521.12491
To evaluate the models that have been used in this research, there are two parameters that will be inspected: mean squares of error (MSE) and adjusted R-squared.
[1] 6.975618
The MSE = 6.975618, this means there is an error rate of 6.83119. As mentioned earlier, the adjusted R-squared of the linear regression model is 0.1999, which means that 19,99% of the variance in the mean final grade can be explained by the predictors.
[1] 7.850942
[1] 0.1323372
The MSE = 7.850942, this means there is an error rate of 7.850942. The R-squared of the cart algorithm is 0.1323372, indicating that 13,23% of the variance in the the mean final grade can be explained by the predictors.
Call:
randomForest(formula = formula_significant, data = train_set, importance = TRUE, proximity = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 9.268307
% Var explained: 15.41
The MSE = 9.268307, this means there is an error rate of 9.268307. The variance in the mean final grade explained by the predictors, R-squared of the random forest algorithm, is 15,41%.
In studies the ‘better-performed model’ has a lower mean squares of
error and a higher adjusted R-squared. However the difference between
the ‘better-performed model’, Multiple linear regression, and the
Stepwise Regression is a minimal difference of 0.0001 in the higher
adjusted R-squared, but the stepwise regression needs less variables to
predict the dependent variable grade. With this
normalization, the chosen model is the Stepwise Regression. grade = The
final grade= 11.056469 + 0.909050 * addressU + 1.436078 * mjobhealth -
0.342567 * mjobother + 0.789537 * mjobservices - 0.004827 * mjobteacher
- 0.370854 * studytime2 + 0.885870 * studytime3 + 1.544417 * studytime4
- 0.725731 * failures - 2.409775 * schoolsupyes - 0.639157 * paidyes +
1.157629 * higheryes - 0.107179 * absences.
Given the model, we can conclude that living in a urban environment, a mother with a job in health care or civil services, weekly study time of 5 hours or more and ambition to take higher education have a positive impact on their grade, while a mother with a job as other or teacher, studying 2 to 5 hours per week, number of past class failures, extra educational school support, extra paid classes and number of school absences have a negative impact on their grade.
The most important predictors in this model are the following:
failures, schoolsup and absences.
For every number of past class failures, the grade can decrease with
0.725731. For extra educational school support, the negative effect of
receiving this extra support on their grade is 2.409775. For school
absences, every absence can have a negative effect of 0.107179 on their
final grade.
The hypotheses that have been accepted are the following: H2b (positive relationship between parents with higher-educated occupations and the grade), H2c (negative relationship between receiving extra educational school support and the grade), H2f (positive relationship between living in a urban area and the grade), (partly) H2k (positive relationship between weekly studytime and the grade), H2l (negative relationship between number of school absences and the grade) and H2m (negative relationship between number of past class failures and the grade), these hypotheses are coherent with our regression model.
Our results do not align with all stated hypotheses, indicating a discrepancy between the expected outcomes based on our hypotheses and the actual findings from our study. The statistically insignificant ‘mother and father education’, famsup, paid, internet, sex, age and higher at \(\alpha=0.05\) respectively contradict H2a (positive relationship between parents with higher-education and the grade), H2d (positive relationship between receiving family educational school support and the grade), H2e (positive relationship between following extra paid classes and the grade), H2g (positive relationship between having access to internet and the grade), H2h (negative relationship between sex male and the grade), H2i (negative relationship between age and the grade), H2j (positive relationship between ambition to higher education and the grade). We have rejected these hypothesis.
The research question under consideration focused on the impact of various socioeconomic factors on the academic performance of high school students in Portugal. Recognizing the influence of socioeconomic factors on academic achievement is crucial for devising specific policies to narrow the divide between advantaged and disadvantaged students. By tackling these discrepancies, educational systems can strive for equal opportunities, promoting a fairer and more equitable society for all. Therefore, this research considered the following research question: To what extent do socioeconomic factors, including demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support , impact the academic performance of high school students in Portugal? To answer this question, we first examined the overall academic performance of high school students in taking the Mathematics course Portugal and then categorized the predictors into five socioeconomic determinants: demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support. We subsequently conducted a detailed exploration of each variable in our Explanatory Data Analysis (EDA).
Our findings revealed that a majority of high school students, that
is 76.2%, successfully passed their Mathematics course, while 23.8% did
not meet the passing requirements. During our Explanatory Data Analysis
(EDA), we observed that among the demographic factors, only the variable
age showed significance. However, educational and
environmental factors such as studytime,
failures, absences, higher, and
address, all played a substantial role in influencing
academic performance. Additionally, parental education emerged as a
significant factor affecting students’ grades. When it comes to
educational and social support, we found that only the variable
schoolsup had a significant impact on academic
performance.
In summary, our results suggest that the following variables play a
significant role in determining students’ grades: age,
studytime, failures, attendance,
higher, internet, address,
parental education (measured by medu and
fedu), fathers occupation (measured by mjob),
and schoolsup. Consequently, we accepted hypotheses
H2b, H2c, H2f,
H2k, H2l, and H2m,
and rejected hypothesis H2a, H2d,
H2e, H2g, H2h,
H2i, H2j. These findings underscore
the complex nature of academic performance, where numerous variables
influence the circumstances in which students live and study, ultimately
affecting their grades.
Conducting further research can help identify additional factors that might have been overlooked in this research. Factors such as cultural influences, community support, and individual resilience could play significant roles but might not have been adequately explored. By delving deeper into these aspects, researchers can develop a more comprehensive understanding of the complex nature of socioeconomic influences on academic performance.
Amponsah, K. D., Aboagye, G. K., Narh-Kert, M., Commey-Mintah, P., & Boateng, F. K. (2022). The Impact of Internet Usage on Students’ Success in Selected Senior High Schools in Cape Coast metropolis, Ghana. European Journal of Educational Sciences, 9(2), 1–18. https://doi.org/10.19044/ejes.v9no2a1
Boztas, S. (2023, July 7). Education or “selecting the new elite”? Criticism of primary school testing - DutchNews.nl. DutchNews.nl. https://www.dutchnews.nl/2023/04/education-or-selecting-the-new-elite-criticism-of-primary-school-testing/
Diris, L. B. R. (2022). Ongelijkheid in het Nederlandse onderwijs door de jaren heen. ESB. https://esb.nu/ongelijkheid-in-het-nederlandse-onderwijs-door-de-jaren-heen/
European Commission. (2023, September 14). Education in Portugal. Retrieved October 1, 2023, from https://eurydice.eacea.ec.europa.eu/national-education-systems/portugal/overview
Hayes, S. (2014). Where smart people live: What your chosen home says about your intelligence. Bustle. https://www.bustle.com/articles/30626-where-smart-people-live-what-your-chosen-home-says-about-your-intelligence#
Idris, M., Hussain, S., & Ahmad, N. (2020). Relationship between Parents’ Education and their children’s Academic Achievement. Journal of Arts and Social Sciences, 7(2), 82–92. https://doi.org/10.46662/jass-vol7-iss2-2020(82-92
Ministerie van Onderwijs, Cultuur en Wetenschap. (2023, January 12). Later selecteren, beter differentiëren. Advies | Onderwijsraad. https://www.onderwijsraad.nl/publicaties/adviezen/2021/04/15/later-selecteren-beter-differentieren
Owuor, C. A., Simatwa, E. M. W., & Ndolo, M. A. (2022). Influence of Parental Occupation on Student Academic Performance in Public Secondary Schools in Kenya: A Case Study of Homa Bay Sub County. International Journal of Current Research, 14(02), 20878-20885.
Sara. (2023, June 19). 6 Common Types Of Students To Enrol In Tutoring | Oxford Learning. Oxford Learning. https://www.oxfordlearning.com/types-of-tutoring-students/
The Annie E. Casey Foundation. (2022, December 16). Parental involvement in your child’s education. https://www.aecf.org/blog/parental-involvement-is-key-to-student-success-research-shows#
Leon, C. (2018). The Relation between Absences and Grades: A Statistical Analysis Munich Personal RePEc Archive. https://mpra.ub.uni-muenchen.de/84655/
Ullah, R., & Ullah, H. (2019). Boys versus girls’ educational performance: Empirical evidences from global north and global south. African Educational Research Journal, 7(4), 163–167. https://doi.org/10.30918/aerj.74.19.036